Susan Holmes (c)
July 3, 2017
## Loading required package: knitr
Birds of a Feather
Finding a latent or hidden variable present which we are not necessarily provided.
{width = “80%”}
Study which types of data can be beneficially clustered.
See measures of (dis)similarity and that help us define clusters.
Uncover hidden or latent clustering by partitioning the data into tighter sets.
Show how relevant clustering can be in immunology where immune cells can be naturally grouped into subpopulation.
Explain methods such as kmeans or kmedoids on multivariate cell populations.
We will explain the recursive approach to clustering that combines observations and groups into a hierarchy of sets and called hierarchical clustering.
The validation of clusters is a challenge that can be overcome by using resampling-based approaches using the bootstrap that we will demonstrate in some single-cell data.
data-label=“fig:cancerHC”
More homogeneous
A recent example of effective use of clustering is the realization that even a disease such as Breast Cancer does not have a single molecular signature but there are many Molecular type of Cancers.
The groups can then be used to provide more specific, optimized treatments.
FlowCytometry CD4/CD3
Cholera Map
LondonBombings
Breast Cancer
This is a case where we can benefit from putting the data in an augmented format: \((X,g)\) where \(g\) is the unknown cluster the data come from. The variable \(g\) is categorical, taking on as many different values as there are clusters.
Methods that add useful but unknown components to the data like this are called data augmentation methods
There are already thousands of existing clustering algorithms combining different choices for the basic distance used to compare observations.
We decompose these choices according to the following steps starting from an observation by feature matrix X, we have to choose a distance matrix from which to construct the clusters, hoping that they are well separated.
ClusteringWorkflow
Of a feather How are distances measured, or how is similarity between observations defined ?
Distances
Taxicab
If you know what is the relevant notion of `closeness’ or similarity for your data, you have (almost) solved the problem.
Observations are just sequences of 0’s and 1’s, presence of 1’s in the same spots does not present the same importance as that of the 0’s: Jaccard distance=1-JC.
\(S\) and \(T\), let’s call it \(f_{00}\). The Jaccard index is \[\begin{equation} J(S,T) = \frac{f_{11}}{f_{01}+f_{10}+f_{11}}, \end{equation}\] and the \[\begin{equation} d_J(S,T) = 1-J(S,T) = \frac{f_{01}+f_{10}}{f_{01}+f_{10}+f_{11}}. \end{equation}\]mut=read.csv("/Users/susan/Books/CUBook/data/HIVmutations.csv")
mut[1:3,10:16]## p32I p33F p34Q p35G p43T p46I p46L
## 1 0 1 0 0 0 0 0
## 2 0 1 0 0 0 1 0
## 3 0 1 0 0 0 0 0
library("vegan")
mutJ=vegdist(mut,"jaccard")
mutC=1-abs(cor(t(mut)))
round(mutJ,2)## 1 2 3 4
## 2 0.80
## 3 0.75 0.89
## 4 0.90 0.78 0.85
## 5 1.00 0.80 0.89 0.90
round(as.dist(mutC),2)## 1 2 3 4
## 2 0.70
## 3 0.61 0.85
## 4 0.87 0.64 0.84
## 5 0.94 0.70 0.85 0.87
require(vegan)
data(dune)
dist.dune=vegdist(dune)
symnum(as.matrix(dist.dune))## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1
## 2 .
## 3 . .
## 4 . .
## 5 , . . .
## 6 , . . ,
## 7 . . . .
## 8 , . . . , . .
## 9 . . . . . . . .
## 10 . . . . . . .
## 11 . . . . , . . . . .
## 12 * , . . , , , . . , ,
## 13 + . . . , , , . . , , .
## 14 1 , , , + + + . , , + , ,
## 15 1 * , , + + + . , + , , , .
## 16 * + , , + + + . , + + . , . .
## 17 + + + + , , , + + , , * + + + 1
## 18 , . , , . . . , , . . , , + , + ,
## 19 1 + + , , , , , , , . , + + , * . .
## 20 1 * , , + + + . , + + , , . . * , ,
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
%
These methods are known as non-hierarchical Clustering or iterative relocation methods. There are several initial choices to be made with these methods and when the a priori knowledge is not available this can be a drawback. The first is the number of clusters suspected. Each time the algorithm is run, initial `seeds’ for each cluster have to be provided, for different starting configurations the answers can be different.
The function kmeans is the one that can be used in R.
kmeans worksrequire(animation)
kmeans.ani(x = matrix(c(runif(100),runif(100)), ncol = 2, dimnames = list(NULL, c("X1", "X2"))), centers = 3, pch = 1:3,col = 1:3, hints = c("Move centers", "Find clusters"))#set.seed(95647745)
set.seed(952224887)
Xmat=matrix(runif(200),ncol=2)
nk=3
cents=matrix(runif(2*nk),ncol=2)
###Default distance=Euclidean
dist1=function(vec){dist(rbind(vec,cents[1,]))}
dist2=function(vec){dist(rbind(vec,cents[2,]))}
dist3=function(vec){dist(rbind(vec,cents[3,]))}
dists123=cbind(apply(Xmat,1,dist1),apply(Xmat,1,dist2),apply(Xmat,1,dist3))
clust0=apply(dists123,1,which.min)
out1=kmeans(Xmat,cents,iter.max=1)
out2=kmeans(Xmat,cents,iter.max=2)
out3=kmeans(Xmat,cents,iter.max=3)
plot(Xmat,type='n')
text(Xmat[,1],Xmat[,2],out1$cluster)
text(Xmat,col=c('red','blue','green')[out1$cluster],out1$cluster)require(ggplot2)
data0=data.frame(x=Xmat[,1],y=Xmat[,2],clust=as.factor(clust0))
data1=data.frame(x=Xmat[,1],y=Xmat[,2],clust=as.factor(out1$cluster))
###Extra points, centers of clusters
cdg=data.frame(x=cents[,1],y=cents[,2],clust="c")
pp=ggplot(data=data0,aes(x= x, y = y, colour = clust, shape = clust))+
scale_shape_discrete(solid=T, guide=F) +
geom_point(data=data0, mapping=aes(x=x, y=y, shape=clust,colour=clust), size=5)+
geom_point(data = cdg, colour = "black",size=5,shape=1)
ppcents=out1$centers
cdg=data.frame(x=cents[,1],y=cents[,2],clust="c")
pp=ggplot(data=data1,aes(x= x, y = y, colour = clust, shape = clust))+
scale_shape_discrete(solid=T, guide=F) +
geom_point(data=data1, mapping=aes(x=x, y=y, shape=clust,colour=clust), size=5)+
geom_point(data = cdg, colour = "black",size=5,shape=1)
ppA more evolved method called Dynamical Clusters is also interesting because it repeats the process many times and builds what is known as `strong forms’ which are groups of observations that end up in the same classes for most possible initial configurations.
Have to choose how many classes they will be prior to the analysis. Can depend on the initial seeds, so we may need to repeat the analyses over and over again.
Very similar to kmeans but the centers are observation points.
Then \(m_k = x_{i^*_k} , k = 1, 2,\ldots , K\) are the current estimates of the cluster centers.
Iterate steps 1 and 2 until the assignments do not change.
#####
km4 <- kmeans(simul4$values,4)
plot(simul4$values, col = km4$cluster, xlab = "", ylab = "", main = "kmeans")When the number of clusters is not known a priori, it is useful to perform hierarchical clustering because it provides a graphical summary that enables us to evaluate where the cutoff for choosing clusters should occur.
This class of methods starts with all the observations making their own cluster and agglomerates gradually the points together by similarity so the number of clusters decreases as points are clumped together into clusters, of course the user will have to make choices:
Like a mobile
Here are three representations of the same hierarchical clustering tree.
The Same Tree
Good for recognizing the number of clusters …. \(\ldots\) … But combs.
my.hclustfun=function(d){hclust(d, method = "single")}
my.distfun=function(x) {vegdist(x)}
heatmap.2(as.matrix(dune),hclustfun=my.hclustfun,distfun=my.distfun,col=color) This will separate the groups as much as possible, \[ D_{12}=\min_{i \in C_1, i \in C_2 } d_{ij}. \]
String objects together to form clusters, and the resulting clusters tend to represent long chains.'' orcombs’’.
image
Compact classes …. one observation can alter groups.
my.hclustfun=function(d){hclust(d, method = "complete")}
my.distfun=function(x) {vegdist(x)}
heatmap.2(as.matrix(dune),hclustfun=my.hclustfun,distfun=my.distfun,col=color) Furthest neighbor: the distances between clusters are determined by the greatest distance between any two objects in the different clusters. This method usually performs quite well in cases when the objects actually form naturally distinct clumps. If the clusters tend to be somehow elongated or of a chain type nature, then this method is inappropriate.
Classes have similar size and variance.
my.hclustfun=function(d){hclust(d, method = "aver")}
my.distfun=function(x) {vegdist(x)}
color = rev(heat.colors(256))
heatmap.2(as.matrix(dune),hclustfun=my.hclustfun,distfun=my.distfun,col=color) \[ D_{12}=\frac{1}{|C_1| |C_2|}\sum_{i \in C_1, i \in C_2 } d_{ij}\] Unweighted pair-group average. (Goldilocks method)
require(vegan)
hclust.aver <- hclust(dist.dune, method="aver")
plot(hclust.aver)More robust to outliers.
image
vegemite(dune, hclust.aver)##
## 111211 11 11 1
## 46507912334891856720
## Comapalu 2.2.................
## Callcusp 43.3................
## Eleopalu 4854.......4........
## Ranuflam 2224....2..2........
## Airaprae ....23..............
## Empenigr .....2..............
## Agrostol 4745...454843.......
## Juncarti .334.......44.......
## Salirepe ...5.3........3.....
## Hyporadi ....25.......2......
## Chenalbu ........1...........
## Alopgeni .4.....857253.....2.
## Sagiproc .....3.42.5222......
## Bracruta .444.3.4.222246262.2
## Cirsarve ..........2.........
## Juncbufo .......43...4....2..
## Scorautu 2.2226.2222325533353
## Elymrepe ......4..44.6..4..4.
## Trifrepe 6.1..2.3221233225256
## Anthodor ....44.........432.4
## Poatriv .2....2496545..64574
## Poaprat ....1.4.254444323444
## Lolipere ......7..65427226656
## Rumeacet .......2....2..563..
## Bellpere .........22...22..32
## Vicilath .............21....1
## Planlanc ....2........33555.3
## Achimill ....2.1........22234
## Trifprat ...............252..
## Bromhord ..........3....2.244
## sites species
## 20 30
my.hclustfun=function(d){hclust(d, method = "aver")}
my.distfun=function(x) {vegdist(x)}
color = rev(heat.colors(256))
heatmap.2(as.matrix(dune),hclustfun=my.hclustfun,distfun=my.distfun,col=color) Tree shapes differ:
require(vegan)
hclust.aver <- hclust(dist.dune, method="aver")
plot(hclust.aver)Hierarchical clustering is a bottom up approach where similar observations and subclasses are assembled iteratively.
image
The Figure above shows how Linnæus made nested clusters of organisms according to more specific characteristics.
This hierarchical organization has been useful in many fields and goes back to Aristotle who postulated a ladder of nature .
##############################################
hc = hclust(dist(simul4$values), method = "ward")## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
memb <- cutree(hc, k = 4)
plot(simul4$values, col = memb, xlab = "", ylab = "", main = "hclust Euclidean ward") ###############Compute the kmeans within group wss for k=1 to 9
wss = sum(apply(scale(simul4$value,scale=FALSE),2,function(x){x^2}))
for (k in 2:20) {
km4 <- kmeans(simul4$values,k)
wss = c(wss, sum(km4$withinss))
}
plot(1:20, wss, xlab='k', ylab='WSS(k)', type='l', lwd=2, col='red')plot(1:20, log(wss), xlab='k', ylab='logWSS(k)', type='l', lwd=2, col='red')plot(hclust(dist(simul4$values), method = "ward"))## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
For each observation i, the silhouette width s(i) is defined as follows: Put a(i) = average dissimilarity between i and all other points of the cluster to which i belongs (if i is the only observation in its cluster, s(i) := 0 without further calculations).
For all other clusters C, put d(i,C) = average dissimilarity of i to all observations of C.
The smallest of these d(i,C) is called \[ b(i) := min_C d(i,C) \] and can be seen as the dissimilarity between i and its “neighbor” cluster, i.e., the nearest one to which it does not belong. Finally, \[ s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ). \]
library(fpc)
library(cluster)
d=simul4$values
asw <- numeric(20)
for (k in 2:20)
asw[[k]] <- pam(d, k)$silinfo$avg.width
k.best <- which.max(asw)
#cat("Silhouette-optimal number of clusters:", k.best, "\n")
k.best## [1] 4
p4=pam(x=d,k=k.best)
si=silhouette(p4)
plot(si,col="red",main="Simulated Data")library(cluster)
asw <- numeric(20)
for (k in 2:18)
asw[[k]] <- pam(dune, k)$silinfo$avg.width
k.best <- which.max(asw)
cat("Silhouette-optimal number of clusters:", k.best, "\n")## Silhouette-optimal number of clusters: 4
p4=pam(x=dune,k=k.best)
si=silhouette(p4)
plot(si,col="red",main="Dune Data")ccas <- cascadeKM(decostand(dune, "hell"), 2, 15)Within and between sums of squares:
Between and Within
W in black and B in Red.
Calinski and Harabasz, 1974: \[(n-k)*sum(\mbox{diag}(B))/((k-1)*sum(\mbox{diag}(W)))\qquad B \mbox{ between } W \mbox{ within}\]
plot(nclusters, type = "h", xlab = "k clusters", ylab = "CH index",
main = "Optimal number of clusters: Simulated Data")data.dist=dist.dune
data=dune
data.cluster = pam.clustering(data.dist, k = 3)
require(clusterSim)
nclusters = index.G1(data, data.cluster, d = data.dist, centrotypes = "medoids")
nclusters = NULL
for (k in 1:19) {
if (k == 1) {
nclusters[k] = NA
} else {
data.cluster_temp = pam.clustering(data.dist, k)
nclusters[k] = index.G1(data, data.cluster_temp, d = data.dist, centrotypes = "medoids")
}
}
plot(nclusters, type = "h", xlab = "k clusters", ylab = "CH index",
main = "Optimal number of clusters: Dune Data")#load("http://bios221.stanford.edu/data/Msig3transp.RData")
load("/Users/susan/Dropbox/CaseStudies/Msig3transp.RData")
#####You'll need to load the data from the data folder to your computer
#####and then load it
dim(Msig3transp)## [1] 30 156
head(Msig3transp[,1:10])## X3968 X14831 X13492 X5108 X16348
## HEA26_EFFE_1 -2.6108361 -1.1857923 -0.05612926 -0.1470299 0.5230758
## HEA26_MEM_1 -2.2592865 -0.4718562 0.27730991 0.5438521 -0.3659554
## HEA26_NAI_1 -0.2719081 0.8170308 0.81269167 0.7198425 -0.9049818
## MEL36_EFFE_1 -2.2431022 -1.0806009 -0.23557156 -0.1760595 0.6406782
## MEL36_MEM_1 -2.6798807 -0.1463671 0.24909402 0.9519808 -0.2040837
## MEL36_NAI_1 -0.4685057 1.0693272 0.75844932 0.7898928 -0.7998404
## X585 X11495 X17522 X12908 X17992
## HEA26_EFFE_1 -0.01816524 -0.8553673 1.27731835 1.716015 0.4826907
## HEA26_MEM_1 0.10784211 -1.0245616 0.64556610 3.618092 1.2100542
## HEA26_NAI_1 0.74819063 -1.7309336 0.01931762 3.924107 1.7128855
## MEL36_EFFE_1 0.01000666 -0.9778569 1.35571272 1.418629 -0.4003710
## MEL36_MEM_1 0.16530245 -1.3276209 0.56192130 3.858240 0.7366370
## MEL36_NAI_1 0.92653015 -1.9024371 -0.05866747 3.941283 1.4401721
dr=round(dist(Msig3transp),2)
heatmap(as.matrix(Msig3transp),labCol = "")corM=cor(as.matrix(Msig3transp))
heatmap(corM,distfun=function(c)as.dist(1-c),labCol = "",labRow="")corT=cor(as.matrix(t(Msig3transp)))
heatmap(corT,distfun=function(c)as.dist(1-c),labCol ="")load("/Users/susan/Dropbox/CaseStudies/Msigbinary.save")
head(disjonct)## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] -1 -1 0 -1 -1 -1 -1 -1 0 -1 -1 -1 -1
## [2,] -1 -1 1 -1 0 1 -1 -1 0 -1 -1 0 -1
## [3,] 0 0 1 0 0 1 0 0 1 0 0 0 0
## [4,] 0 1 1 0 1 1 0 1 1 0 1 0 0
## [5,] 1 -1 -1 1 0 -1 0 -1 -1 1 0 0 1
## [6,] 0 0 1 0 0 1 0 0 1 0 0 1 0
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] -1 0 -1 -1 0 -1 -1 0 -1 -1 0
## [2,] 0 1 -1 -1 1 -1 -1 1 -1 -1 0
## [3,] 0 1 0 0 1 0 0 1 0 0 1
## [4,] 1 1 0 1 1 0 1 1 0 1 1
## [5,] 0 -1 1 -1 -1 0 -1 -1 1 0 -1
## [6,] 0 1 1 0 1 0 0 1 0 0 1
## [,25] [,26] [,27] [,28] [,29] [,30]
## [1,] -1 -1 0 -1 -1 0
## [2,] -1 0 1 -1 0 1
## [3,] 0 0 1 0 0 1
## [4,] -1 0 0 0 0 0
## [5,] 0 -1 -1 1 0 -1
## [6,] 0 0 1 0 0 1
heatmap(as.matrix(disjonct),distfun=function(c)dist(c,"manhattan"),labCol = "")library("dplyr")
simul4=lapply(c(0,8),function(x){
lapply(c(0,8),function(y){
data.frame(x=rnorm(100,x,2),y=rnorm(100,y,2),class=paste(x,y,sep=""))
}) %>% do.call(rbind,.)
}) %>% do.call(rbind,.)Let’s now use the method on a real example: we will reload the Hiiragi data we explored in Lecture 3 [@Ohnishi2014].
We want to investigate the level of clustering of in particular to investigate the hypothesis that some of cells in embyronic day 3.5 ( E3.5 ) falls ‘naturally’ into two clusters corresponding to while the data for embryonic day 3.25 ( E3.25 ) do not.
library("cluster")
library("Hiiragi2013")
data("x")###Pick the top 50 genes most variable in wt
ngenes=50
WT=x[,x$genotype=="WT"]
selFeats = order(rowVars(exprs(WT)), decreasing = TRUE)[seq_len(ngenes)]
submat=exprs(x[selFeats,])
rownames(submat)=fData(x)[selFeats, "symbol"]
z=t(submat)
resGAP=clusGap(z,FUN=pam1,K.max=10,verbose=FALSE)In fact the default choice of resGAP ( globalSEmax : first value for which the gap(k) is not larger than the global maximum within a standard error \(s\)) gives k=8, whereas the choice recommended in @gap2001 (the smallest k such that \(gap(k) \geq gap(k+1) - s_{k+1}\)) gives k=4.
plot(resGAP)pam8=pam(z, k=8, cluster.only=TRUE)
pam4=pam(z, k=4, cluster.only=TRUE)
###Using the biological labels to understand the groups
table(pam8,pData(x)[names(pam8), "sampleGroup"])##
## pam8 E3.25 E3.5EPI E3.5KO E3.5PE E325KO E4.5EPI E4.5KO E4.5PE
## 1 4 0 0 0 15 0 0 0
## 2 20 2 0 0 0 0 0 0
## 3 6 0 8 0 2 0 0 0
## 4 5 8 0 1 0 0 0 0
## 5 1 1 0 0 0 0 10 0
## 6 0 0 0 10 0 0 0 0
## 7 0 0 0 0 0 0 0 4
## 8 0 0 0 0 0 4 0 0
table(pam4,pData(x)[names(pam4), "sampleGroup"])##
## pam4 E3.25 E3.5EPI E3.5KO E3.5PE E325KO E4.5EPI E4.5KO E4.5PE
## 1 5 0 6 0 17 0 0 0
## 2 29 7 0 0 0 0 0 0
## 3 1 1 0 11 0 0 0 4
## 4 1 3 2 0 0 4 10 0
One of the most important parts of clustering is the projection supplementary biological information onto the clusters as we have done in the tables above.
BootstrapCluster
We saw the bootstrap principle in Lecture ; in a perfect world, we would use many samples of data from the same distribution and generate a clustering for each data set and then see how the clusterings change using an index such as those we used above to compare clusterings.
However, in fact we are going to create new data sets simply by taking subsamples of the data and looking at the different clusterings we get each time and comparing them. @gap2001 recommend using bootstrap resampling to infer the number of clusters using the gap statistic.
BootstrapCluster
Hiiragi data [@Ohnishi2014].
Investigation of the hypothesis that the inner cell mass (ICM) of the mouse blastocyst in embyronic day 3.5 ( E3.5 ) falls ‘naturally’ into two clusters corresponding to pluripotent epiblast (EPI) versus primitive endoderm (PE) while the data for embryonic day 3.25 ( E3.25 ) do not.
We will not use the true group labels in the clustering step and only use them in the final interpretation of the results. We will use a bootstrap applied to the two different data sets ( E3.5 ) and ( E3.25 ) separately. Each step of the bootstrap will generate a clustering of a random subset of the data and we will need to compare these through a consensus of an ensemble of clusters. There is a useful framework for this in the clue package.
clusterResampling = function(x, ngenes, k = 2, B = 250, prob = 0.67) {
mat = exprs(x)
ce = cl_ensemble(list = lapply(seq_len(B), function(b) {
selSamps = sample(ncol(mat),size =round(prob*ncol(mat)),replace = F)
submat = mat[, selSamps, drop = FALSE]
selFeats = order(rowVars(submat), decreasing =TRUE)[seq_len(ngenes)]
submat = submat[selFeats,, drop = FALSE]
pamres = pam(t(submat), k = k)
pred = cl_predict(pamres, t(mat[selFeats, ]), "memberships")
as.cl_partition(pred)
}))
cons = cl_consensus(ce)
ag = sapply(ce, cl_agreement, y = cons)
return(list(agreements = ag, consensus = cons))
}To summarize, the function clusterResampling performs the following steps:
Draw a random subset of the data (the data are either all E3.25 or all E3.5 samples) by selecting 67% of the samples without replacement.
Select the top ngenes (see below) features by overall variance (in the subset) .
Apply \(k\)-means clustering, and predict the cluster memberships of the samples that were not in the subset with the cl_predict method, through their proximity to the cluster centres.
Repeat steps 1-3 for B \(=250\) times.
Apply consensus clustering ( cl_consensus ).
For each of the B \(=250\) clusterings, measure the agreement with the consensus ( cl_agreement ); here, good agreement is indicated by a value of 1, and less agreement by smaller values. If the agreement is generally high, then the clustering into \(k\) classes can be considered stable and reproducible; inversely, if it is low, then no stable partition of the samples into \(k\) clusters is evident.
As a measure of between-cluster distance for the consensus clustering, the Euclidean dissimilarity of the memberships is used, i. e., the square root of the minimal sum of the squared differences of \(\mathbf{u}\) and all column permutations of \(\mathbf{v}\), where \(\mathbf{u}\) and \(\mathbf{v}\) are the cluster membership matrices. As agreement measure for step 6, the quantity \(1 - d / m\) is used, where \(d\) is the Euclidean dissimilarity, and \(m\) is an upper bound for the maximal Euclidean dissimilarity.
E325=clusterResampling(x[, unlist(groups[c("E3.25")])], ngenes = 20)
E35=clusterResampling(x[, unlist(groups[c("E3.5 (EPI)", "E3.5 (PE)")])],ngenes = 20)
ce = list( "E3.25" = E325,"E3.5" = E35)Cluster stability analysis with E3.25 and E3.5 (WT:wild type) samples. (Left: boxplot of the cluster agreements with the consensus, for B =250 Clusterings; 1 indicates perfect agreement, and the value decreases with worse agreement. Right: membership probabilities of the consensus clustering; colors are as in the left panel. For E3.25, the probabilities are diffuse, indicating that the individual (resampled) clusterings disagree a lot, whereas for E3.5, the distribution is bimodal, with only one ambiguous sample.)
colours = c(sampleColourMap["E3.25"], brewer.pal(9,"Set1")[9])
boxplot(lapply(ce, `[[`, "agreements"), ylab = "agreement probabilities", col = colours)Results from Clue
The results are shown in Figures above and below. They confirm the hypothesis that the E.35 data fall into two clusters.
mems = lapply(ce, function(x) sort(cl_membership(x$consensus)[, 1]))
mgrp = lapply(seq(along = mems), function(i) rep(i, times = length(mems[[i]])))
myjitter = function(x) x+seq(-.4, +.4, length.out = length(x))
plot(unlist(lapply(mgrp, myjitter)), unlist(mems),
col = colours[unlist(mgrp)], ylab = "membership probabilities",
xlab = "consensus clustering", xaxt = "n", pch = 16)
text(x = 1:2, y = par("usr")[3], labels = c("E3.25","E3.5"), adj = c(0.5, 1.4), xpd = NA)Cluster stability analysis with E3.25 and E3.5 (WT:wild type) samples. (Left: boxplot of the cluster agreements with the consensus, for B =250 Clusterings; 1 indicates perfect agreement, and the value decreases with worse agreement. Right: membership probabilities of the consensus clustering; colors are as in the left panel. For E3.25, the probabilities are diffuse, indicating that the individual (resampled) clusterings disagree a lot, whereas for E3.5, the distribution is bimodal, with only one ambiguous sample.)
Suppose that we have a bivariate distribution of observations made with the same error variances, however the sampling is done from two groups which have very different baseline frequencies.
Suppose, as an example, that the errors are continuous independent bivariate normally distributed. We have \(10^{3}\) of seq1 and \(10^{5}\) of seq2 , as generated for instance by the code:
library("mixtools")## mixtools package, version 1.1.0, Released 2017-03-10
## This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.
##
## Attaching package: 'mixtools'
## The following object is masked from 'package:gtools':
##
## ddirichlet
library("ggplot2")
seq1 = rmvnorm(n = 1e3, mu = -c(1, 1), sigma = 0.5*diag(c(1, 1)))
seq2 = rmvnorm(n = 1e5, mu = c(1, 1), sigma = 0.5*diag(c(1, 1)))
twogr = data.frame(
rbind(seq1, seq2),
seq = factor(c(rep(1, nrow(seq1)),
rep(2, nrow(seq2))))
)
colnames(twogr)[1:2] = c("x", "y")
ggplot(twogr, aes(x = x, y = y, colour = seq,fill = seq)) +
geom_hex(alpha = 0.5, bins = 50) + coord_fixed()Unequal size clusters
The observed values would look as in Figure @ref(fig:seq:radius).
Question Take the data seq1 and seq2 and assign them to groups according to a hard distance threshold with balls of radius.
image
See @kahneman2011 for a book-length treatment of our natural heuristics and the ways in which they can mislead us when we make probability calculations (we recommend especially chapters 14 and 15).
Simulate n reads of length len , when error probability is perr , and all we care about whether a base was called correctly ( TRUE ) or not ( FALSE ).
n = 5e3
len = 200
perr = 0.001
seqs = matrix(runif(n * len) >= perr, nrow = n, ncol = len)Now, compute all pairwise distances between reads
dists = as.matrix(dist(seqs, method = "manhattan"))% For various values of number of reads k (from 10 to n), the maximum distance within this set of reads. %
library("tibble")
dfseqs = tibble(
k = 10 ^ seq(1, log10(n), length.out = 20),
diameter = vapply(k, function(i) {
s = sample(n, i)
max(dists[s, s])
}, numeric(1)))
ggplot(dfseqs, aes(x = k, y = diameter)) +
geom_point() + geom_smooth()## `geom_smooth()` using method = 'loess'
We will now use reference probabilities to improve and denoise 16SrRNA-read clustering.
There are so-called variable regions of the 16SrRNA gene in bacteria that are taxa specific: fingerprint that enables taxon [^4] identification.
FASTQ-files \(\longrightarrow\) Iterative alternating \(\longrightarrow\) probabilistic noise de novo \(\longrightarrow\)(RSV).
We start by loading the dada2 that provides all the function to denoise these data using de novo clustering:
library(dada2)## Loading required package: Rcpp
The DADA method (Divisive Amplicon Denoising Algorithm)[@Rosen:2012] relies on a parameterized model of substitution errors that distinguishes sequencing errors from real biological variation. The model computes the probabilities of seeing an \({\sf A}\) instead of a \({\sf C}\) regardless of its position along the sequence. Because error rates vary substantially between sequencing runs and PCR protocols, the model parameters are estimated from the data themselves using an EM-type approach. A read is classified as noisy or exact given the current parameters, then the noise model parameters are updated accordingly[^7].
The dereplicated sequences are read in and then divisive denoising and estimation is run with the dada function as in the following code:
# Download "http://bios221.stanford.edu/data/derepFs.rds"
# Download "http://bios221.stanford.edu/data/derepRs.rds"
derepFs=readRDS("/Users/susan/Books/CUBook/data/derepFs.rds")
derepRs=readRDS("/Users/susan/Books/CUBook/data/derepRs.rds")
###Intermediary, only run the first time
ddF = dada(derepFs, err = NULL, selfConsist = TRUE)
ddR = dada(derepRs, err = NULL, selfConsist = TRUE)Fit between the observed error rates (black points) and the fitted error rates (black lines).
plotErrors(ddF)Forward transition error rates as provided by plotErrors(ddF) , this shows the frequencies of each type of nucleotide transition as a function of quality.
Once the errors have been estimated, the algorithm is rerun on the data to find the sequence variants:
dadaFs = dada(derepFs, err=ddF[[1]]$err_out, pool = TRUE)
dadaRs = dada(derepRs, err=ddR[[1]]$err_out, pool = TRUE)mergers = mergePairs(dadaFs, derepFs, dadaRs, derepRs)Contingency table the common Ribosomal Strain Variants (RSV):
seqtab.all = makeSequenceTable(mergers[!grepl("Mock", names(mergers))])Chimera are sequences that are artificially created during PCR amplification by the melding of different, usually only two, of the original sequences together. To complete our denoising workflow, we remove them with a call to the function removeBimeraDenovo , leaving us with a clean contingency table we will use later on.
seqtab = removeBimeraDenovo(seqtab.all)## As of the 1.4 release, the default method changed to consensus (from pooled).
outliers is possible but be honest: record and explain.We saw at the start of the Lecture how finding the right distance is an essential first step in a clustering analysis, this is a case when the garbage in, garbage out motto is in full force. Always choose a distance which is scientifically meaningful and compare output from as many distances as possible; sometimes the same data require different distances when different scientific objectives are pursued.
We saw two different types of clustering approaches: iterative partitionning approaches such as kmeans and kmedoids (PAM) that alternated between estimating the clusters and assigning points to them and hierarchical clustering approaches that agglomerate points and then small clusters into larger ones in a nested sequence of sets that can be represented by hierarchical clustering trees.
We saw examples of using clustering for single cell measurements and how to supplement standard with density-based clustering that works quite well for lower dimensioanl data where sparsity is not an issue.
Clustering algorithms always deliver clusters so we need to assess their quality and the number of clusters to choose carefully.
These validation steps are done using visualization tools and repeating the clustering on many resamples of the data. We saw how statistics such as the wss/bss or \(\log(\mbox{\tt wss})\) can calibrated using simulation on data where we understand the group structure and can provides useful benchmarks for choosing the number of cluster.
Of course, the use of biologically relevant information to inform and confirm the meaning of clusters is always the best validation approach.
Finally we saw that distances are not everything, taking into account baseline frequencies and local densities when clustering can be important; we showed this in a case study of how to use clustering to denoise 16s rRNA sequence reads.
house=read.table("/Users/susan/Dropbox/CaseStudies/votes.txt")
head(house[,1:10])## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 -0.5 -0.5 0.5 -0.5 0.0 0.5 0.5 0.5 0.5 0.5
## 2 -0.5 -0.5 0.5 -0.5 0.0 0.5 0.5 0.5 0.5 0.5
## 3 0.5 0.5 -0.5 0.5 -0.5 0.5 0.5 -0.5 -0.5 -0.5
## 4 0.5 0.5 -0.5 0.5 -0.5 0.5 0.5 -0.5 -0.5 -0.5
## 5 0.5 0.5 -0.5 0.5 -0.5 0.5 0.5 -0.5 -0.5 -0.5
## 6 -0.5 -0.5 0.5 -0.5 0.0 0.5 0.5 0.5 0.5 0.5
party=scan("/Users/susan/Dropbox/CaseStudies/party.txt")
#table(party)## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:permute':
##
## how
require(scatterplot3d)## Loading required package: scatterplot3d
house.cmds=cmdscale(disthouse,k=20,eig=TRUE)
round(house.cmds$eig[1:20],1)## [1] 63.4 1.8 1.6 1.0 0.6 0.6 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3
## [15] 0.3 0.3 0.3 0.2 0.2 0.2
scatterplot3d(house.cmds$points[,1:3])scatterplot3d(house.cmds$points[,1:3],color=mycols[party+1])house.kpca=kpca(house.K)
scatterplot3d(rotated(house.kpca)[,1:3])house.kpca=kpca(house.K)
scatterplot3d(rotated(house.kpca)[,1:3],color=mycols[party+1])library(rgl)
plot3d(rotated(house.kpca),col=mycols[party+1])